Alison’s EDA before collaboration
First, read the 3 csv files downloaded from 2019 OECD study of violence against women.
Attitudes toward violence: The percentage of women who agree that a husband/partner is justified in beating his wife/partner under certain circumstances
Prevalence of violence in the lifetime: The percentage of women who have experienced physical and/or sexual violence from an intimate partner at some time in their life
Laws on domestic violence: Whether the legal framework offers women legal protection from domestic violence Laws on domestic violence are presented as values ranging from 0 to 1, where 0 means that laws or practices do not discriminate against women’s rights and 1 means laws or practices fully discriminate against women’s rights.
By just Eyeballing the data, I can see that: 1. Not all countries are listed and not all countries have record on all three features. 2. The range of three features are different (0-100% or 0-1). Strange thing is that law only has 4 values: 0.25, 0.5, 0.75 and 1. It looks more categorical than continuous. 3. The countries are abbreviated…so I need some way to match the code with the names
# Data Source: https://www.iban.com/country-codes
country_code_df = read.csv('./Data/country_code.csv')
So now I have a way to match the country codes to their English names.
I will put everything in a dataframe by full outer joining three subsets of the dataframes.
attitude_sub <- attitude_df[c('LOCATION', 'Value')]
colnames(attitude_sub) <- c('Country', 'Attitude')
law_sub <- law_df[c('LOCATION', 'Value')]
colnames(law_sub) <- c('Country', 'Law')
prevalence_sub <- prevalence_df[c('LOCATION', 'Value')]
colnames(prevalence_sub) <- c('Country', 'Prevalence')
df <- full_join(attitude_sub, law_sub, by = "Country")
df <- full_join(df, prevalence_sub, by = "Country")
head(df)
Country Attitude Law Prevalence
1 AUS 3.2 0.75 16.9
2 CAN 7.8 0.25 1.9
3 FIN 11.2 0.75 30.0
4 FRA 6.6 0.25 26.0
5 DEU 19.6 0.75 22.0
6 HUN 8.7 0.75 21.0
Looks good! But there must be many empty values. Let’s check.
summary(df)
Country Attitude Law Prevalence
Length:163 Min. : 0.00 Min. :0.250 Min. : 1.90
Class :character 1st Qu.: 8.60 1st Qu.:0.500 1st Qu.:18.30
Mode :character Median :22.05 Median :0.750 Median :24.60
Mean :27.52 Mean :0.592 Mean :28.96
3rd Qu.:42.52 3rd Qu.:0.750 3rd Qu.:35.00
Max. :92.10 Max. :1.000 Max. :85.00
NA's :11 NA's :34
Indeed, there are 11 empty values in Attitude and 34 in Prevalence.
Although we already have some amazing interactive visualization on the data source website, we visualize a bit more here.
df %>%
ggplot(aes(x = Attitude)) +
geom_histogram(binwidth=2) +
labs(
title = 'Histogram of Attitudes toward violence',
subtitle = '152 countryies included',
x = 'Attitudes',
y = 'Frequency'
)
df %>%
ggplot(aes(x = Prevalence)) +
geom_histogram(binwidth=2) +
labs(
title = 'Histogram of Prevalence of violence in the lifetime',
subtitle = '129 countryies included',
x = 'Prevalence',
y = 'Frequency'
)
df %>%
ggplot(aes(x = Law)) +
geom_bar() +
scale_x_continuous(breaks = c(0.25, 0.5, 0.75, 1.00)) +
labs(
title = 'Histogram of Laws on domestic violence',
subtitle = '163 countryies included',
x = 'Law',
y = 'Frequency'
)
This one is not so informative because there are only 4 values.
df %>%
ggplot(aes(x = Attitude, y = Prevalence)) +
geom_point() +
labs(
title = 'Scatterplot of Prevalence vs Attitude',
x = 'Attitude',
y = 'Prevalence'
)
We can see a positive correlation.
df %>%
ggplot(aes(x = Law, y = Prevalence)) +
geom_point() +
labs(
title = 'Scatterplot of Prevalence vs Law',
x = 'Law',
y = 'Prevalence'
)
df %>%
ggplot(aes(x = Law, y = Attitude)) +
geom_point() +
labs(
title = 'Scatterplot of Attitude vs Law',
x = 'Law',
y = 'Attitude'
)
This looks fancy but is pretty useless. I cannot see much from this 3D plot.